Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16837 container: Add client-side DFS metrics #15544

Merged
merged 5 commits into from
Dec 18, 2024
Merged

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Nov 27, 2024

If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Features: container telemetry
Signed-off-by: Michael MacDonald [email protected]

Copy link

github-actions bot commented Nov 27, 2024

Ticket title is 'Add DFS-level metrics to client telemetry'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16837

@mjmac mjmac changed the title mjmac/DAOS 16837 DAOS-16837 container: Add client-side DFS metrics Nov 27, 2024
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15544/1/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15544/1/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15544/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15544/3/execution/node/1131/log

@mjmac mjmac force-pushed the mjmac/DAOS-16837 branch 3 times, most recently from 768a10d to bc4444e Compare December 5, 2024 16:07
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15544/6/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15544/7/testReport/

src/client/dfs/obj.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15544/8/execution/node/1506/log

@mjmac
Copy link
Contributor Author

mjmac commented Dec 9, 2024

@mchaarawi: I've removed the dependency on the new container property for now, as I'd prefer not to hold this patch up if that becomes a sticking point. Please take a look at let me know if there are any significant changes that you'd like. I can add some ftest coverage for the new DFS metrics if it looks OK.

@ashleypittman: I based this patch on the stats you added to dfuse. It's a bit heavier-weight because it relies on the telemetry library instead of atomics, but maybe there's an opportunity to merge the approaches?

@mjmac mjmac requested a review from mchaarawi December 9, 2024 23:32
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor issues that need to be addressed

src/client/dfs/io.c Outdated Show resolved Hide resolved
src/client/dfs/io.c Show resolved Hide resolved
src/client/dfs/io.c Outdated Show resolved Hide resolved
src/client/dfs/obj.c Outdated Show resolved Hide resolved
src/client/dfs/obj.c Outdated Show resolved Hide resolved
src/client/dfs/obj.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15544/10/execution/node/1543/log

mchaarawi
mchaarawi previously approved these changes Dec 11, 2024
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.
It would be great if we can enable metrics for something like our existing ior_small.py and mdtest_small.py ci tests and do some verifications there. whether in this PR or other PR is fine with me.

src/client/dfs/io.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15544/12/execution/node/1211/log

@daltonbohning
Copy link
Contributor

I suspect the copyright GHA is failing because the workflow is coming from a merge of this PR + master, but the source tree used is just this PR.. Something I'll need to consider in the future for new GHA. Anyway, the copyright GHA is not required so can be ignored

If metrics are enabled for a POSIX container, create
a new container/$UUID/dfs metrics root in the client
telemetry to provide DFS-oriented metrics (POSIX ops,
file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Features: container telemetry
Required-githooks: true
Signed-off-by: Michael MacDonald <[email protected]>
Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to address the consider-using-generator pylint error.

@mjmac
Copy link
Contributor Author

mjmac commented Dec 16, 2024

Need to address the consider-using-generator pylint error.

Argh. Do you guys have a preferred solution for this? Seems like overkill compared to a list comprehension. Should I just add an ignore?

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
kjacque
kjacque previously approved these changes Dec 16, 2024
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look as deeply into the DFS changes. Go and telemetry changes LGTM.

src/client/dfs/metrics.c Outdated Show resolved Hide resolved
Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
@daltonbohning
Copy link
Contributor

@mjmac Is this correct? Total "sample_count": 2, but all 4 buckets - 1048575, 2097151, 4194303, +Inf - have count 2?

2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]             "labels": {
2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]               "container": "6981025d-695a-4839-ac13-1e64bd7db15c",
2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]               "jobid": "wolf-233vm3.wolf.hpdd.intel.com-84216",
2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]               "pid": "84216"
2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]             },
2024-12-15 02:49:18,255 process          L0416 DEBUG| [stdout]             "sample_count": 2,
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]             "sample_sum": 1048576,
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]             "buckets": [
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "upper_bound": 255
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,256 process          L0416 DEBUG| [stdout]                 "upper_bound": 511
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "upper_bound": 1023
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]                 "upper_bound": 2047
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,257 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "upper_bound": 4095
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "upper_bound": 8191
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,258 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "upper_bound": 16383
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "upper_bound": 32767
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,259 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "upper_bound": 65535
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "upper_bound": 131071
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,260 process          L0416 DEBUG| [stdout]                 "upper_bound": 262143
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "cumulative_count": 0,
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "upper_bound": 524287
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "cumulative_count": 2,
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]                 "upper_bound": 1048575
2024-12-15 02:49:18,261 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "cumulative_count": 2,
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "upper_bound": 2097151
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "cumulative_count": 2,
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "upper_bound": 4194303
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               },
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               {
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "label": "",
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "cumulative_count": 2,
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]                 "upper_bound": "+Inf"
2024-12-15 02:49:18,262 process          L0416 DEBUG| [stdout]               }
2024-12-15 02:49:18,263 process          L0416 DEBUG| [stdout]             ]
2024-12-15 02:49:18,263 process          L0416 DEBUG| [stdout]           },

@mjmac
Copy link
Contributor Author

mjmac commented Dec 16, 2024

@mjmac Is this correct? Total "sample_count": 2, but all 4 buckets - 1048575, 2097151, 4194303, +Inf - have count 2?

Yep. It's a Cumulative Histogram. That's the format that Prometheus likes.

@daltonbohning
Copy link
Contributor

@mjmac Is this correct? Total "sample_count": 2, but all 4 buckets - 1048575, 2097151, 4194303, +Inf - have count 2?

Yep. It's a Cumulative Histogram. That's the format that Prometheus likes.

Interesting :) I don't think I've seen that before

@mjmac
Copy link
Contributor Author

mjmac commented Dec 17, 2024

@phender: Nudge... Please advise on how you want me to address your -1 review.

Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ftest changes.

@mjmac mjmac requested review from mchaarawi and kjacque December 17, 2024 20:16
@mjmac
Copy link
Contributor Author

mjmac commented Dec 17, 2024

@mchaarawi, @kjacque: Should be good to go now.

@mjmac mjmac merged commit ea7f05d into master Dec 18, 2024
58 checks passed
@mjmac mjmac deleted the mjmac/DAOS-16837 branch December 18, 2024 19:14
mjmac added a commit that referenced this pull request Dec 18, 2024
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Dec 20, 2024
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Dec 31, 2024
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Jan 9, 2025
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Jan 14, 2025
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Jan 14, 2025
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Jan 15, 2025
If metrics are enabled for a POSIX container, create
a new pool/$UUID/container/$UUID/dfs metrics root in
the client telemetry to provide DFS-oriented metrics
(POSIX ops, file I/Os, etc).

Also fixes a bug in the agent code for pruning unused
client telemetry segments.

Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Jan 17, 2025
DAOS-16837 container: Add client-side DFS metrics (#15544)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants